Visual Web Information Extraction with Lixto
نویسندگان
چکیده
We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. In this convenient user-interface very expressive extraction programs can be created. Internally, this functionality is reflected by the new logicbased declarative language Elog. Users never have to deal with Elog and even familiarity with HTML is not required. Lixto can be used to create an “XML-Companion” for an HTML web page with changing content, containing the continually updated XML translation of the relevant information.
منابع مشابه
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. This paper describes some advanced features of Lixto, such as disjunctive pattern definitions, specialization rules, and Lixto’...
متن کاملWeb Information Acquisition with Lixto Suite: A Demonstration∗
We demonstrate the Lixto Suite, a web data extraction and transformation software kit for retrieving and converting information from various sources to various customer devices. With the Lixto Suite, non-technical content managers can rapidly develop applications in the areas of M-Commerce, E-Commerce, content integration and corporate portals.
متن کاملAnnotating the Legacy Web with Lixto
Introduction The Semantic Web is still a vision. The unstructured Web of today contains millions of documents which cannot be queried and where layout and structure are heavily mixed. Moreover, they are not annotated at all. There is a huge gap between Web information and the qualified, structured data as required in corporate information systems. According to the vision of the Semantic Web, al...
متن کاملThe Lixto Project: Exploring New Frontiers of Web Data Extraction
The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extrac...
متن کاملSupervised Wrapper Generation with Lixto
We illustrate basic features of the Lixto wrapper generator such as the user and system interaction, the capacious visual interface, the marking and selecting procedures, and the extraction tasks by describing the construction of a simple example program in the current Lixto prototype.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001